Standard video frame interpolation methods first estimate optical flowbetween input frames and then synthesize an intermediate frame guided bymotion. Recent approaches merge these two steps into a single convolutionprocess by convolving input frames with spatially adaptive kernels that accountfor motion and re-sampling simultaneously. These methods require large kernelsto handle large motion, which limits the number of pixels whose kernels can beestimated at once due to the large memory demand. To address this problem, thispaper formulates frame interpolation as local separable convolution over inputframes using pairs of 1D kernels. Compared to regular 2D kernels, the 1Dkernels require significantly fewer parameters to be estimated. Our methoddevelops a deep fully convolutional neural network that takes two input framesand estimates pairs of 1D kernels for all pixels simultaneously. Since ourmethod is able to estimate kernels and synthesizes the whole video frame atonce, it allows for the incorporation of perceptual loss to train the neuralnetwork to produce visually pleasing frames. This deep neural network istrained end-to-end using widely available video data without any humanannotation. Both qualitative and quantitative experiments show that our methodprovides a practical solution to high-quality video frame interpolation.
展开▼